As a way of showcasing what we have learned in Data Management, Cleaning and Imputation, we were tasked with cleaning and exploring one of the data sets provided to us. Towards that end, we selected the data set pertaining to dogs that are available for adoption in the United States. Using this data, we will attempt to draw some big-picture insights about the population of sheltered dogs in the US.
We will be utilizing RStudio to perform cleaning and exploratory analysis on the data set. And, in order to make our work easier to understand, the steps we took will be broken down into the following:
library("tidyverse")# To help tidy up the data
library("readr") # To import .csv files in a more feature rich way
library("here") # To make it easier to work collaboratively on the project
library("dplyr") # For data/dataframe manipulation
library("usmap") # US map plots
library("ggplot2") # Data visualization
library("prettydoc")# document themes for R Markdown
library("DT") # used for displaying R data objects (matrices or data frames)
# as tables on HTML pages
As the data was provided to us, the source is on our class’ Canvas page
The data is broken down into three .csv files:
The data was collected on 9/20/2019 from various dog adoption organizations across the US.
here::i_am("Final.Rproj")
here() # set current directory to top-level of project
# load dog_adoptable, dog_descriptions and dog_destination
dog_adoptable <- read_csv("data/raw/dog_adoptable.csv")
dog_descriptions <- read_csv("data/raw/dog_descriptions.csv")
dog_destination <- read_csv("data/raw/dog_destination.csv")
# loading the variable information for each table
dog_adoptable_variable_table <- read_csv("data/raw/dog_adoptable_variable_table.csv")
dog_descriptions_variable_table <- read_csv("data/raw/dog_descriptions_variable_table.csv")
dog_destination_variable_table <- read_csv("data/raw/dog_destination_variable_table.csv")
This data set is a single record per state.
| variable | class | description |
|---|---|---|
| location | character | The full name of the US state or country |
| exported | double | The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location |
| imported | double | The number of adoptable dogs available in this state that originated in a different location |
| Total | double | The total number of adoptable dogs available in a given state. |
| inUS | logical | Whether or not a location is in the US or not. Here, US territories will return |
This data set is a single record per dog.
| variable | class | description |
|---|---|---|
| id | double | The unique identification number for each animal. |
| org_id | character | The unique identification number for each shelter or rescue. |
| species | character | Species of animal. |
| breed_primary | character | The primary (assumed) breed assigned by the shelter or rescue. |
| breed_secondary | character | The secondary (assumed) breed assigned by the shelter or rescue. |
| breed_mixed | logical | Whether or not an animal is presumed to be mixed breed. |
| breed_unknown | logical | Whether or not the animal’s breed is completely unknown. |
| color_primary | character | The most prevalent color of an animal. |
| color_secondary | character | The second most prevalent color of an animal. |
| color_tertiary | character | The third most prevalent color of an animal. |
| age | character | The assumed age class of an animal (Baby, |
| sex | character | The sex of an animal (Female, |
| size | character | The general size class of an animal (Small, |
| coat | character | Coat Length for each animal (Curly, |
| fixed | logical | Whether or not an animal has been spayed/neutered. |
| house_trained | logical | Whether or not an animal is trained to not go to the bathroom in the house. |
| declawed | logical | Whether or not the animal has had its dewclaws removed. |
| special_needs | logical | Whether or not the animal is considered to have special needs (this can be a long-term medical condition or particular temperament that requires extra care). |
| shots_current | logical | Whether or not the animal is up to date on all of their vaccines and other shots. |
| env_children | logical | Whether or not the animal is recommended for a home with children. |
| env_dogs | logical | Whether or not the animal is recommended for a home with other dogs. |
| env_cats | logical | Whether or not the animal is recommended for a home with cats. |
| name | character | The animal?s name (as given by the shelter/rescue). |
| tags | character | Any tags given to the dog by the shelter rescue (pipe |
| photo | character | The URL to the animal?s primary photo. |
| status | character | Whether the animal is |
| posted | character | The date that this animal was first listed on a local website . |
| contact_city | character | The rescue/shelter?s listed city. |
| contact_state | character | The rescue/shelter?s listed state. |
| contact_zip | character | The rescue/shelter?s listed zip code. |
| contact_country | character | The rescue/shelter?s listed country. |
| stateQ | character | The state abbreviation queried in the API to return this result . |
| accessed | double | The date that this data was acquired from the PetFinder API. |
| type | character | The type of animal. |
| description | character | The full description of an animal, as entered by the rescue or shelter. This is the only field returned by the V1 API. |
This data set is a single record per transfer (of dogs between destinations).
| variable | class | description |
|---|---|---|
| Id | double | The unique identification number for each animal |
| contact_city | character | The rescue/shelter’s listed city |
| contact_state | character | The rescue/shelter’s listed State |
| Description | character | The full description of each animal as entered by the rescue/shelter |
| Found | character | Where the animal was found. |
| Manual | character | . |
| Remove | logical | Animal removed from location |
| still_there | logical | TRUE/FALSE |
Now, after reviewing the data set’s variables and their meanings we had to decide how to clean them. We found that rather than going through and cleaning ever single variable, it worked better for us to clean the data while keeping in mind what questions we wanted to be able to answer using the data.
Some of the questions we wanted to answer were:
Figuring out what questions we wanted to ask first helped us shave down the number of variables we needed to keep (and clean) in the data set.
# dog_adoptable
# update field inUS to snake_case in_us
dog_adoptable <- rename(dog_adoptable, in_us = inUS)
# filter to just true for in_us
dog_adoptable <- filter(dog_adoptable, in_us == TRUE)
# drop field in_us as all records have same value
dog_adoptable <- select(dog_adoptable, !in_us)
# replace all NA with 0
dog_adoptable <- mutate_all(dog_adoptable, ~replace(., is.na(.), 0))
# rename location to state
dog_adoptable <- rename(dog_adoptable, state = location)
# dog_destination (removed)
Not only this, but the values in the ‘found’ column were inconsistent. It lists countries, counties, cities, and nonsensical values such as ‘Sunday 10am’ or ‘Glaucoma’. As a result, we decided to drop the table completely.
# dog_descriptions
# drop stateQ as is only "The state abbreviation queried in the API to return this result "
dog_descriptions <- select(dog_descriptions, !stateQ)
# drop status field as all are dogs adoptable
dog_descriptions <- select(dog_descriptions, !status)
# drop species field as all are dogs so adds no value.
dog_descriptions <- select(dog_descriptions, !species)
# drop type field as all are dogs so adds no value.3 are NA but confirmed they are dogs with their description
dog_descriptions <- select(dog_descriptions, !type)
# drop photo as all are NA
dog_descriptions <- select(dog_descriptions, !photo)
# drop name as useless
dog_descriptions <- select(dog_descriptions, !name)
# drop tags as useless
dog_descriptions <- select(dog_descriptions, !tags)
# drop description as useless
dog_descriptions <- select(dog_descriptions, !description)
# drop declawed as all are NA
dog_descriptions <- select(dog_descriptions, !declawed)
# drop contact_country as all are in the US, some have state or zip here by error
dog_descriptions <- filter(dog_descriptions, contact_country == "US")
dog_descriptions <- select(dog_descriptions, !contact_country)
dog_descriptions <- mutate(dog_descriptions, posted_date = as.Date(posted))
dog_descriptions <- mutate(dog_descriptions, accessed_date = as.Date(accessed, "%d/%m/%Y"))
dog_descriptions <- mutate(dog_descriptions, days_in_shelter =
as.numeric(difftime(dog_descriptions$accessed_date,
dog_descriptions$posted_date , units = c("days"))))
dog_descriptions <- select(dog_descriptions, !c(posted, posted_date, accessed, accessed_date))
# fix zip na, all are in Boston 02108
dog_descriptions <- mutate_at(dog_descriptions, vars("contact_zip"), ~replace(., is.na(.), 02108))
# pad with zeros on left side
dog_descriptions <- mutate(dog_descriptions,
zip = str_pad(string = contact_zip,
width = 5,
side = "left",
pad = "0"))
#rename to city state and zip
dog_descriptions <- select(dog_descriptions, !contact_zip)
dog_descriptions <- rename(dog_descriptions, city = contact_city)
dog_descriptions <- rename(dog_descriptions, state = contact_state)
# state abbreviation to full names
state.abb.and.name <- tibble(state.abb, state.name)
# left join for state information
dog_descriptions <- left_join(dog_descriptions, state.abb.and.name, by = c("state" = "state.abb"))
In this section we renamed, and mutated some variables.
# breed_secondary
dog_descriptions$breed_secondary[is.na(dog_descriptions$breed_secondary)] <- "NONE / UNKNOWN"
# color_primary
dog_descriptions$color_primary[is.na(dog_descriptions$color_primary)] <- "OTHER"
# color_secondary
dog_descriptions$color_secondary[is.na(dog_descriptions$color_secondary)] <- "NONE / OTHER"
# color_tertiary
dog_descriptions$color_tertiary[is.na(dog_descriptions$color_tertiary)] <- "NONE / OTHER"
# coat
dog_descriptions$coat[is.na(dog_descriptions$coat)] <- "OTHER"